This Rmarkdown documents contains the code and 2D simulations of two generic classes with linear separation. The
windowsFonts(A = windowsFont("DejaVu Sans"))
palette("Tableau10")
library(caret)
library(kernlab)
library(MASS)
library(dplyr)
library(knitr)
library(kableExtra)
source("predict_with_uncertainty_parallel_v2.R")
source("predict_noise_confidence_interval.R")
source("classify_sample_with_limits.R")
source("plot_functions.R")
set.seed(1234)
n_points <- 10 # Number of points per class
n_replicates <- 5 # Number of replicates
noise_sd <- 0.1 # Standard deviation
rho <- 0 # Correlation coefficient
# Define the covariance matrix
cov_matrix <- matrix(c(noise_sd^2, rho * noise_sd^2,
rho * noise_sd^2, noise_sd^2), nrow = 2)
# Function to generate 2D points with correlated noise
generate_class <- function(n_points, n_replicates, center_x, center_y, cov_matrix) {
points <- matrix(0, nrow = n_points * n_replicates, ncol = 2)
for (i in 1:n_points) {
base_point <- c(center_x + rnorm(1, 0, 0.2), center_y + rnorm(1, 0, 0.2))
for (j in 1:n_replicates) {
noise <- mvrnorm(1, mu = c(0, 0), Sigma = cov_matrix) # Correlated noise
points[(i - 1) * n_replicates + j, ] <- base_point + noise
}
}
return(points)
}
class1 <- generate_class(n_points, n_replicates, center_x = 0.8, center_y = 1.6, cov_matrix)
class2 <- generate_class(n_points, n_replicates, center_x = 1.3, center_y = 1.2, cov_matrix)
data <- data.frame(
Class = factor(rep(c("Class 1", "Class 2"), each = n_points * n_replicates)),
Sample_ID = rep(1:(2 * n_points), each = n_replicates),
X1 = c(class1[, 1], class2[, 1]),
X2 = c(class1[, 2], class2[, 2])
)
# Mean center the data
data$X1 <- data$X1 - mean(data$X1)
data$X2 <- data$X2 - mean(data$X2)
# Calculate the mean and standard deviation of each sample
mean_data <- aggregate(cbind(X1, X2) ~ Class + Sample_ID, data = data, FUN = mean)
sd_data <- aggregate(cbind(X1, X2) ~ Class + Sample_ID, data = data, FUN = sd)
rm(class1, class2)
pls_model <- train(y = data$Class,
x = data[,3:4],
method = "pls")
The predict_with_uncertainty_parallel() function is used
to estimate the uncertainty of each sample.
loc_pred <- predict_with_uncertainty_parallel(data[,-1],
pls_model,
n_sim = 500,
return_simulations = T)
This function returns the mean and standard deviation of the \(\hat{y}\) of each sample based on the Monte Carlo simulation. It also returns the area above and bellow the decision boundary (in this case, zero) of each distribution. The classification output indicates if the sample is being classified with certainty or not. For samples where the distribution crosses the decision boundary (considering the 95% confidence level), the sample will be assigned as uncertain. Once ’ return_simulations = T’, the augmented noise will be saved. The results for this 2D simulated data is displayed in
| class | sample | y_mean | y_sd | area_above_0 | area_below_0 | classification |
|---|---|---|---|---|---|---|
| Class 1 | 1 | 0.4786 | 0.1032 | 1.000 | 0.000 | Class 1 |
| Class 1 | 2 | 0.4537 | 0.1195 | 1.000 | 0.000 | Class 1 |
| Class 1 | 3 | 0.0986 | 0.0963 | 0.847 | 0.153 | Uncertain |
| Class 1 | 4 | 0.4058 | 0.0813 | 1.000 | 0.000 | Class 1 |
| Class 1 | 5 | 0.2430 | 0.1179 | 0.980 | 0.020 | Class 1 |
| Class 1 | 6 | 0.5741 | 0.1117 | 1.000 | 0.000 | Class 1 |
| Class 1 | 7 | 0.4035 | 0.1193 | 1.000 | 0.000 | Class 1 |
| Class 1 | 8 | 0.2370 | 0.1309 | 0.965 | 0.035 | Uncertain |
| Class 1 | 9 | 0.6676 | 0.1423 | 1.000 | 0.000 | Class 1 |
| Class 1 | 10 | 0.1896 | 0.0734 | 0.995 | 0.005 | Class 1 |
| Class 2 | 1 | -0.5636 | 0.1685 | 0.000 | 1.000 | Class 2 |
| Class 2 | 2 | -0.2418 | 0.1437 | 0.046 | 0.954 | Uncertain |
| Class 2 | 3 | -0.5007 | 0.0723 | 0.000 | 1.000 | Class 2 |
| Class 2 | 4 | -0.6533 | 0.1070 | 0.000 | 1.000 | Class 2 |
| Class 2 | 5 | -0.2090 | 0.1529 | 0.086 | 0.914 | Uncertain |
| Class 2 | 6 | 0.0334 | 0.2122 | 0.562 | 0.438 | Uncertain |
| Class 2 | 7 | -0.2577 | 0.1542 | 0.047 | 0.953 | Uncertain |
| Class 2 | 8 | -0.4661 | 0.1486 | 0.001 | 0.999 | Class 2 |
| Class 2 | 9 | -0.4444 | 0.1509 | 0.002 | 0.998 | Class 2 |
| Class 2 | 10 | -0.4536 | 0.1204 | 0.000 | 1.000 | Class 2 |
These results can be visualized, as showed in Figure 1.
Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for each sample. 50 points are showed for each sample. (c) Estimated uncertainty for each sample.
It the model is linear and the covariances matrices are similar across all the samples in the data set, the uncertainty estimation can be estimated at once. In that case, instead of verifying if the uncertainty of a sample is distributed across the decision boundary, the uncertainty is placed on the decision boundary and it is verified if a future prediction is within the decision boundary confidence intervals. In that case, the ‘predict_noise_confidence_interval’ is used to estimate the limits of the decision boundary with 95% of confidence.
global_limits <- predict_noise_confidence_interval(pls_model,
data$Sample_ID,
n_sim = 500,
return_simulations = T)
data.frame(global_limits$lower,
global_limits$upper)
## global_limits.lower global_limits.upper
## 1 -0.2283833 0.2283833
Then, the classification with confidence for future samples can be calculated using the ‘classify_sample_with_limits()’ function.
global_pred <- classify_sample_with_limits(mean_data[,3:4],
pls_model,
global_limits$upper)
| class | sample | y_mean | classification |
|---|---|---|---|
| Class 1 | 1 | 0.4732621 | Class 1 |
| Class 1 | 2 | 0.4588093 | Class 1 |
| Class 1 | 3 | 0.1024438 | Uncertain |
| Class 1 | 4 | 0.4031910 | Class 1 |
| Class 1 | 5 | 0.2484593 | Class 1 |
| Class 1 | 6 | 0.5794556 | Class 1 |
| Class 1 | 7 | 0.4073326 | Class 1 |
| Class 1 | 8 | 0.2327395 | Class 1 |
| Class 1 | 9 | 0.6634845 | Class 1 |
| Class 1 | 10 | 0.1912483 | Uncertain |
| Class 2 | 1 | -0.5621914 | Class 2 |
| Class 2 | 2 | -0.2395806 | Class 2 |
| Class 2 | 3 | -0.4979997 | Class 2 |
| Class 2 | 4 | -0.6519285 | Class 2 |
| Class 2 | 5 | -0.2083974 | Uncertain |
| Class 2 | 6 | 0.0288956 | Uncertain |
| Class 2 | 7 | -0.2678114 | Class 2 |
| Class 2 | 8 | -0.4615743 | Class 2 |
| Class 2 | 9 | -0.4522474 | Class 2 |
| Class 2 | 10 | -0.4475907 | Class 2 |
These results can also be visualized as showed in Figure 2.
Figure 1. Uncertainty of the classification outputs for the PLS-DA model. (a) 2D data simulation. Circles represents the replicates, while the triangles represents the mean of replicates. (b) Augmented noise generated through eigen decomposition of the original covariance matrix for all the samples. 50 points are showed for each sample. (c) Estimated uncertainty for the model.